domain name
Real-PGDN: A Two-level Classification Method for Full-Process Recognition of Newly Registered Pornographic and Gambling Domain Names
Wang, Hao, Wang, Yingshuo, Gan, Junang, Cheng, Yanan, Zhang, Jinshuai
Online pornography and gambling have consistently posed regulatory challenges for governments, threatening both personal assets and privacy. Therefore, it is imperative to research the classification of the newly registered Pornographic and Gambling Domain Names (PGDN). However, scholarly investigation into this topic is limited. Previous efforts in PGDN classification pursue high accuracy using ideal sample data, while others employ up-to-date data from real-world scenarios but achieve lower classification accuracy. This paper introduces the Real-PGDN method, which accomplishes a complete process of timely and comprehensive real-data crawling, feature extraction with feature-missing tolerance, precise PGDN classification, and assessment of application effects in actual scenarios. Our two-level classifier, which integrates CoSENT (BERT-based), Multilayer Perceptron (MLP), and traditional classification algorithms, achieves a 97.88% precision. The research process amasses the NRD2024 dataset, which contains continuous detection information over 20 days for 1,500,000 newly registered domain names across 6 directions. Results from our case study demonstrate that this method also maintains a forecast precision of over 70% for PGDN that are delayed in usage after registration.
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)
SQLForge: Synthesizing Reliable and Diverse Data to Enhance Text-to-SQL Reasoning in LLMs
Guo, Yu, Jin, Dong, Ye, Shenghao, Chen, Shuangwu, Yang, Jian, Tan, Xiaobin
Large Language models (LLMs) have demonstrated significant potential in text-to-SQL reasoning tasks, yet a substantial performance gap persists between existing open-source models and their closed-source counterparts. In this paper, we introduce SQLForge, a novel approach for synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. We improve data reliability through SQL syntax constraints and SQL-to-question reverse translation, ensuring data logic at both structural and semantic levels. We also propose an SQL template enrichment and iterative data domain exploration mechanism to boost data diversity. Building on the augmented data, we fine-tune a variety of open-source models with different architectures and parameter sizes, resulting in a family of models termed SQLForge-LM. SQLForge-LM achieves the state-of-the-art performance on the widely recognized Spider and BIRD benchmarks among the open-source models. Specifically, SQLForge-LM achieves EX accuracy of 85.7% on Spider Dev and 59.8% on BIRD Dev, significantly narrowing the performance gap with closed-source methods.
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
DNS Tunneling: Threat Landscape and Improved Detection Solutions
Amirov, Novruz, Isik, Baran, Tuncer, Bilal Ihsan, Bahtiyar, Serif
--Detecting DNS tunneling is a significant challenge in cybersecurity due to its capacity to hide harmful actions within DNS traffic that appears to be normal and legitimate. Traditional detection methods based on rule-based approaches or signature matching are often insufficient to accurately identify such covert communication channels. This paper addresses the necessity of machine learning methods for effective DNS tunneling detection. We propose a novel approach to detect DNS tunneling. Through the combination of advanced machine learning algorithms and the analysis of various features extracted from DNS traffic, our aim is to provide an accurate DNS tunneling detection model. A. About the Subject The Domain Name System (DNS) is a hierarchical and decentralized naming system crucial for internet functionality [1]. As a core component of internet infrastructure, DNS is used in nearly every online transaction, making it a prime target for a variety of cyber threats. Due to its foundational role and widespread trust, DNS is vulnerable to several types of attacks, threat landscape can be seen in [2], such as cache poisoning, amplification and DoS attacks, and phishing attacks. These vulnerabilities offer attackers multiple possibilities to disrupt or manipulate internet traffic.
- Oceania > Palau (0.04)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.68)
Learning Moderately Input-Sensitive Functions: A Case Study in QR Code Decoding
Yoda, Kazuki, Kawamoto, Kazuhiko, Kera, Hiroshi
The hardness of learning a function that attains a target task relates to its input-sensitivity. For example, image classification tasks are input-insensitive as minor corruptions should not affect the classification results, whereas arithmetic and symbolic computation, which have been recently attracting interest, are highly input-sensitive as each input variable connects to the computation results. This study presents the first learning-based Quick Response (QR) code decoding and investigates learning functions of medium sensitivity. Our experiments reveal that Transformers can successfully decode QR codes, even beyond the theoretical error-correction limit, by learning the structure of embedded texts. They generalize from English-rich training data to other languages and even random strings. Moreover, we observe that the Transformer-based QR decoder focuses on data bits while ignoring error-correction bits, suggesting a decoding mechanism distinct from standard QR code readers.
- Information Technology (0.46)
- Automobiles & Trucks (0.46)
Intelligent Detection of Non-Essential IoT Traffic on the Home Gateway
Palmese, Fabio, Mandalari, Anna Maria, Haddadi, Hamed, Redondi, Alessandro Enrico Cesare
The rapid expansion of Internet of Things (IoT) devices, particularly in smart home environments, has introduced considerable security and privacy concerns due to their persistent connectivity and interaction with cloud services. Despite advancements in IoT security, effective privacy measures remain uncovered, with existing solutions often relying on cloud-based threat detection that exposes sensitive data or outdated allow-lists that inadequately restrict non-essential network traffic. This work presents ML-IoTrim, a system for detecting and mitigating non-essential IoT traffic (i.e., not influencing the device operations) by analyzing network behavior at the edge, leveraging Machine Learning to classify network destinations. Our approach includes building a labeled dataset based on IoT device behavior and employing a feature-extraction pipeline to enable a binary classification of essential vs. non-essential network destinations. We test our framework in a consumer smart home setup with IoT devices from five categories, demonstrating that the model can accurately identify and block non-essential traffic, including previously unseen destinations, without relying on traditional allow-lists. We implement our solution on a home access point, showing the framework has strong potential for scalable deployment, supporting near-real-time traffic classification in large-scale IoT environments with hundreds of devices. This research advances privacy-aware traffic control in smart homes, paving the way for future developments in IoT device privacy.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
Training Large Language Models for Advanced Typosquatting Detection
Since the early days of the commercial internet, typosquatting has exploited the simplest of human errors, mistyping a URL, to serve as a potent tool for cybercriminals. Initially observed as an opportunistic tactic, typosquatting involves registering domain names that closely match that of reputable brands, thereby redirecting users to counterfeit websites. This has evolved into a sophisticated form of cyberattack used to conduct phishing schemes, distribute malware, and harvest sensitive data. Now with billions of domain names and TLDs in circulation, the scale and impact of typosquatting have grown exponentially. This poses significant risks to individuals, businesses, and national cybersecurity infrastructure. This whitepaper explores how emerging large language model (LLM) techniques can enhance the detection of typosquatting attempts, ultimately fortifying defenses against one of the internet's most enduring cyber threats. Cybercriminals employ various domain squatting techniques to deceive users and bypass traditional security measures. These methods include but not limited to: Character Substitution: These attacks swap similar looking characters like replacing "o" with "0" in go0gle[.]com to trick users into believing they are visiting the legitimate site. Omission or Addition: This method involves removing or adding a character, creating domains such as gogle[.]com
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (0.57)
Comprehensive Survey on Adversarial Examples in Cybersecurity: Impacts, Challenges, and Mitigation Strategies
Deep learning (DL) has significantly transformed cybersecurity, enabling advancements in malware detection, botnet identification, intrusion detection, user authentication, and encrypted traffic analysis. However, the rise of adversarial examples (AE) poses a critical challenge to the robustness and reliability of DL-based systems. These subtle, crafted perturbations can deceive models, leading to severe consequences like misclassification and system vulnerabilities. This paper provides a comprehensive review of the impact of AE attacks on key cybersecurity applications, highlighting both their theoretical and practical implications. We systematically examine the methods used to generate adversarial examples, their specific effects across various domains, and the inherent trade-offs attackers face between efficacy and resource efficiency. Additionally, we explore recent advancements in defense mechanisms, including gradient masking, adversarial training, and detection techniques, evaluating their potential to enhance model resilience. By summarizing cutting-edge research, this study aims to bridge the gap between adversarial research and practical security applications, offering insights to fortify the adoption of DL solutions in cybersecurity.
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > Virginia (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Research Report > Promising Solution (0.67)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Communications > Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
LLMs for Domain Generation Algorithm Detection
La O, Reynier Leyva, Catania, Carlos A., Parlanti, Tatiana
We perform a detailed evaluation of two important techniques: In-Context Learning (ICL) and Supervised Fine-Tuning (SFT), showing how they can improve detection. SFT increases performance by using domain-specific data, whereas ICL helps the detection model to quickly adapt to new threats without requiring much retraining. We use Meta's Llama3 8B model, on a custom dataset with 68 malware families and normal domains, covering several hard-to-detect schemes, including recent word-based DGAs. Results proved that LLM-based methods can achieve competitive results in DGA detection. In particular, the SFT-based LLM DGA detector outperforms state-of-the-art models using attention layers, achieving 94% accuracy with a 4% false positive rate (FPR) and excelling at detecting word-based DGA domains.
- South America > Argentina > Cuyo > Mendoza Province > Mendoza (0.04)
- North America > United States > New York (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification
Mahdaouy, Abdelkader El, Lamsiyah, Salima, Idrissi, Meryem Janati, Alami, Hamza, Yartaoui, Zakaria, Berrada, Ismail
Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Asia > China > Hong Kong (0.04)
- Africa > Middle East > Morocco > Fès-Meknès Region > Fez (0.04)
Large Generative Graph Models
Wang, Yu, Rossi, Ryan A., Park, Namyong, Chen, Huiyuan, Ahmed, Nesreen K., Trivedi, Puja, Dernoncourt, Franck, Koutra, Danai, Derr, Tyler
Large Generative Models (LGMs) such as GPT, Stable Diffusion, Sora, and Suno are trained on a huge amount of language corpus, images, videos, and audio that are extremely diverse from numerous domains. This training paradigm over diverse well-curated data lies at the heart of generating creative and sensible content. However, all previous graph generative models (e.g., GraphRNN, MDVAE, MoFlow, GDSS, and DiGress) have been trained only on one dataset each time, which cannot replicate the revolutionary success achieved by LGMs in other fields. To remedy this crucial gap, we propose a new class of graph generative model called Large Graph Generative Model (LGGM) that is trained on a large corpus of graphs (over 5000 graphs) from 13 different domains. We empirically demonstrate that the pre-trained LGGM has superior zero-shot generative capability to existing graph generative models. Furthermore, our pre-trained LGGM can be easily fine-tuned with graphs from target domains and demonstrate even better performance than those directly trained from scratch, behaving as a solid starting point for real-world customization. Inspired by Stable Diffusion, we further equip LGGM with the capability to generate graphs given text prompts (Text-to-Graph), such as the description of the network name and domain (i.e., "The power-1138-bus graph represents a network of buses in a power distribution system."), and network statistics (i.e., "The graph has a low average degree, suitable for modeling social media interactions."). This Text-to-Graph capability integrates the extensive world knowledge in the underlying language model, offering users fine-grained control of the generated graphs. We release the code, the model checkpoint, and the datasets at https://lggm-lg.github.io/.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Missouri > Boone County > Columbia (0.04)
- North America > United States > Michigan (0.04)
- (3 more...)
- Information Technology > Security & Privacy (0.67)
- Energy > Power Industry (0.48)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.46)